Method for data processing in a multi-processor data processing system and a corresponding data processing system

ABSTRACT

The invention is based on the idea to separate a synchronisation operation from reading and writing operations. Therefore, a method for data processing in the data processing system is provided, wherein said data processing system comprises a first and at least a second processor for processing streams of data objects, wherein said first processor passes data objects from a stream of data objects to the second processor. Said data processing system further comprises at least one memory for storing and retrieving data objects, wherein a shared access of said first and second processors is provided. The processors perform a read operations and/or write operations in order to exchange data objects with his said memory. Said processors further perform inquiry operations and/or commit operations in order to synchronise a data object transfer between tasks which are executed by said processors. Said inquiry operations and said commit operations are performed independently of said read operations and said write operations by said processors.

The invention relates to a method for data processing in amulti-processor data processing system and a corresponding dataprocessing system having multiple processors.

A heterogeneous multiprocessor architecture for high performance,data-dependent media processing e.g. for high-definition MPEG decodingis known. Media processing applications can be specified as a set ofconcurrently executing tasks that exchange information solely byunidirectional streams of data. G. Kahn introduced a formal model ofsuch applications already in 1974, ‘The Semantics of a Simple Languagefor Parallel Programming’, Proc. of the IFIP congress 74, August 5-10,Stockholm, Sweden, North-Holland publ. Co, 1974, pp. 471-475 followed byan operational description by Kahn and MacQueen in 1977, ‘Co-routinesand Networks of Parallel Programming’, Information Processing 77, B.Gilchhirst (Ed.), North-Holland publ., 1977, pp 993-998. This formalmodel is now commonly referred to as a Kahn Process Network.

An application is known as a set of concurrently executable tasks.Information can only be exchanged between tasks by unidirectionalstreams of data. Tasks should communicate only deterministically bymeans of a read and write action regarding predefined data streams. Thedata streams are buffered on the basis of a FIFO behaviour. Due to thebuffering two tasks communicating through a stream do not have tosynchronise on individual read or write actions.

In stream processing, successive operations on a stream of data areperformed by different processors. For example a first stream mightconsist of pixel values of an image, that are processed by a firstprocessor to produce a second stream of blocks of DCT (Discrete CosineTransformation) coefficients of 8×8 blocks of pixels. A second processormight process the blocks of DCT coefficients to produce a stream ofblocks of selected and compressed coefficients for each block of DCTcoefficients.

FIG. 1 shows a illustration of the mapping of an application to aprocessor as known from the prior art. In order to realise data streamprocessing a number of processors are provided, each capable ofperforming a particular operation repeatedly, each time using data froma next data object from a stream of data objects and/or producing a nextdata object in such a stream. The streams pass from one processor toanother, so that the stream produced by a first processor can beprocessed by a second processor and so on. One mechanism of passing datafrom a first to a second processor is by writing the data blocksproduced by the first processor into the memory.

The data streams in the network are buffered. Each buffer is realised asa FIFO, with precisely one writer and one or more readers. Due to thisbuffering, the writer and readers do not need to mutually synchronizeindividual read and write actions on the channel. Reading from a channelwith insufficient data available causes the reading task to stall. Theprocessors can be dedicated hardware function units which are onlyweakly programmable. All processors run in parallel and execute theirown thread of control. Together they execute a Kahn-style application,where each task is mapped to a single processor. The processors allowmulti-tasking, i.e., multiple Kahn tasks can be mapped onto a singleprocessor.

It is an object of the invention to improve the operation of aKahn-style data processing system.

This object is solved by a method for processing data in the dataprocessing system according to claim 1 and the corresponding dataprocessing system according to claim 11.

The invention is based on the idea to separate a synchronisationoperation from reading and writing operations. Therefore, a method fordata processing in the data processing system is provided, wherein saiddata processing system comprises a first and at least a second processorfor processing streams of data objects, wherein said first processorpasses data objects from a stream of data objects to the secondprocessor. Said data processing system further comprises at least onememory for storing and retrieving data objects, wherein a shared accessof said first and second processors is provided. The processors performa read operations and/or write operations in order to exchange dataobjects with his said memory. Said processors further perform inquiryoperations and/or commit operations in order to synchronise a dataobject transfer between tasks which are executed by said processors.Said inquiry operations and said commit operations are performedindependently of said read operations and said write operations by saidprocessors.

This has the advantage that the separation of synchronisation operationsand read/or write operations lead to the more efficient implementationthen a usually provided combination thereof. Furthermore, a singlesynchronisation operation can cover a series of read or write operationsat once, reducing the frequency of synchronisation operations.

In a further aspect of the invention said inquiry operations areexecuted by one of said second processors to request the right to accessa group of data objects in said memory, wherein said group of dataobject is produced or consumed in said memory by a series of read/writeoperations by said processors. Moreover, said commit operations areexecuted by one of said second processors to transfer the right toaccess said group of data objects to another of said second processors.

In a preferred aspect of the invention said read/write operations enablesaid second processors to randomly access locations within one of saidgroups of data elements in said memory. Providing a random access in onegroup of data objects in the said memory generates several interestingopportunities like out-of-order processing of data and/or temporarystorage of intermediate data by read and write memory access.

In a further preferred aspect of the invention, the actual task state ofthe partial processing of the group of data objects is discarded andcommit operations on the partial group of data object are preventedafter the task has been interrupted. This allows to interrupt a taskwhile avoiding the costs of saving the actual state of the task.

In still a further preferred aspect of the invention the processorrestarts the processing of the group of data object after resumption ofthe interrupted task, whereby previously processing results on saidgroup of data objects are discarded. This allows to restart theprocessing of the complete group of data objects of the interrupted taskwhile avoiding state restore costs.

In a further aspect of the invention a third processor receives theright to access a group of data objects from said first processor.Thereafter, it performs read and/or write operations on said group ofdata objects, and finally transfers the right of access to said secondprocessor, without copying said group of data objects to anotherlocation in shared memory. This allows to correct or replace single dataobjects.

The invention also relates to a data processing system comprising afirst and at least a second processor for processing streams of dataobjects, said first processor being arranged to pass data objects from astream of data objects to the second processor; and at least one memoryfor storing and retrieving data objects, wherein a shared access forsaid first and said second processors is provided, said processors beingadopted to perform read operations and/or write operations to exchangedata objects with said memory and said processors being adopted toperform inquiry operations and/or commit operations to synchronise dataobject transfers between tasks which are executed by said processors,wherein said processors being adopted to perform said inquiry operationsand said commit operations independently of said read operations andsaid write operations.

Further embodiments of the invention are described in the dependentclaims.

These and other aspects of the invention are described in more detailwith reference to the drawings; the figures showing:

FIG. 1 an illustration of the mapping of an application to a processoraccording to the prior art;

FIG. 2 a flow chart the principal processing of a processor;

FIG. 3 a schematic block diagram of an architecture of a stream basedprocessing system according to a second embodiment;

FIG. 4 an illustration of the synchronising operation and an I/Ooperation in the system of FIG. 3;

FIG. 5 an illustration of a cyclic FIFO memory;

FIG. 6 a mechanism of updating local space values in each shellaccording to FIG. 3;

FIG. 7 an illustration of the FIFO buffer with a single writer andmultiple readers; and

FIG. 8 a finite memory buffer implementation for a three-station stream.

The preferred embodiment of the invention refers to a multi-processorstream-based data processing system preferably comprising the CPU andseveral processors or coprocessors. The CPU passes data objects fromstream of data objects to one of the processors. The CPU and theprocessors are coupled to at least one memory via a bus. The memory isused by the CPU and the processors for storing and retrieving dataobjects, wherein the CPU and the processors have shared access to thememory.

The processors perform a read operations and/or write operations inorder to exchange data objects with his said memory. Said processorsfurther perform inquiry operations and/or commit operations in order tosynchronise a data object transfer between tasks which are executed bysaid processors. Said inquiry operations and said commit operations areperformed independently of said read operations and said writeoperations by said processors.

The synchronisation operations as described above can be separated intoinquiry operations and commit operations. An inquiry operation informsthe processor about the availability of data objects for subsequent readoperation or the availability of room for subsequent write operation,i.e. the this can also be realised by get_data operations and get_roomoperations, respectively. After the processor is notified of theavailable window or the available group of data it can freely access theavailable window or group of data objects in the buffer anyway it likes.Once the processor-has performed the necessarily processing on the groupof data objects or at least on a part of said data objects in said groupof data objects or said access window, the processor can issue thecommit signal to another processor indicating that data or room is newlyavailable in the memory using and put_data or put_room operations,respectively.

However, in the preferred embodiment these four synchronisationoperations do not impose any difference between the processing of thedata and the room operations. Therefore, it is advantageous to summarisethese operations into the single space operations leaving just twooperations for synchronisation, namely get_space and put_space forinquiry and commit, respectively.

The processors explicitly decides on the time instances during a runningtask at which said running task can be interrupted. The processors cancontinue up to a point where no or merely a restricted amount ofprocessing resources, like enough incoming data, sufficient availablespace in the buffer memory or the like, are available to the processors.These points represents the best opportunities for the processors toinitiate task switching. The initiation of task switching is performedby the processor by issuing a call for a task to be processed next. Theintervals between such calls for a next task from the processor can bedefined as processing steps. A processing step may include reading oneor more packets or groups of data, performing some operations on theacquired data and writing one or more packets or groups of data.

The concept of reading and writing packets of groups of data is notdefined or enforced by the overall system architecture. The notion ofpackets or groups of data is not visible at the level of the genericinfrastructure of the system architecture. The data transportoperations, i.e. the reading and writing of data from/into the buffermemory, and the synchronisation operation, i.e. the signalling of theactual consumption of data between the reader and writer for purposes ofbuffer management, are designed to operate on unformatted byte streams.The notion of packets or groups of data emerges only in the next layerof functionality in the system architecture, i.e. inside the processorswhich actually perform the media processing.

Each task running on the processor can be modelled as a repetition ofprocessing steps, wherein each processing step attempts to process apacket or group of data. Before performing such processing step the taskinteracts with a task scheduler in said data processing system in orderto determine with which task the processor is supposed to continue andto provide explicit task switching moment.

In FIG. 2 a flow chart of a general processing of the processor isshown. In step S1 the processor performs a call for the next taskdirected to the task scheduler, in order to determine with which task itis supposed to continue. In step S2, the processor receives from thetask scheduler the respective information about the next task to beprocessed. Thereafter, in step S3 the processing continues with checkinginput streams belonging to the associated task to be processed next inorder to decide whether sufficient data or other processing resourcesare available to perform the requested processing. This initialinvestigation may involve attempts to read some partial input and alsoto decode packet headers. If it is determined in step S4 that theprocessing can continue since all necessary processing resources are athand, the flow jumps to step S5 and the respective processor continueswith processing the current task. After the processor has finished thisprocessing in step S6 the flow will jump to the next processing step andthe above-mentioned steps will be repeated.

However, if in step S4 it is determined that the processor can notcontinue with the processing of the current task, i.e. it can notcomplete the current processing step, due to insufficient processingresources like a lack of data in one of the input streams, the flow willbe forwarded to step S7 and all results of the partial processing doneso far will be discarded without any state saving, i.e. without anysaving of the partial processing results processed so far in thisprocessing step. The partial processing may include some synchronisationcalls, data read operations, or some processing on the acquired data.Thereafter, in step S8 the flow will be directed to restart and fullyre-do the unfinished processing step at a later stage. However,abandoning the current task and discarding the partial processingresults will only be possible as long as the current task did not commitany of its stream actions by sending the synchronisation message.

Especially in function-specific hardware processors removing thenecessity for support for intermediate state saving and restoring it cansimplify their design and reduce their acquired silicon area.

FIG. 3 shows a processing system for processing streams of data objectsaccording to a second embodiment of the invention. The system can bedivided into different layers, namely a computation layer 1, acommunication support layer 2 and a communication network layer 3. Thecomputation layer 1 includes a CPU 11, and two processors 12 a, 12 b.This is merely by way of example, obviously more processors may beincluded into the system. The communication support layer 2 comprises ashell 21 associated to the CPU 11 and shells 22 a, 22 b associated tothe processors 12 a, 12 b, respectively. The communication network layer3 comprises a communication network 31 and a memory 32.

The processors 12 a, 12 b are preferably dedicated processor; each beingspecialised to perform a limited range of stream processing. Eachprocessor is arranged to apply the same processing operation repeatedlyto successive data objects of a stream. The processors 12 a, 12 b mayeach perform a different task or function, e.g. variable lengthdecoding, run-length decoding, motion compensation, image scaling orperforming a DCT transformation. In operation each processor 12 a, 12 bexecutes operations on one or more data streams. The operations mayinvolve e.g. receiving a stream and generating another stream orreceiving a stream without generating a new stream or generating astream without receiving a stream or modifying a received stream. Theprocessors 12 a, 12 b are able to process data streams generated byother processors 12 b, 12 a or by the CPU 11 or even streams that havegenerated themselves. A stream comprises a succession of data objectswhich are transferred from and to the processors 12 a, 12 b via saidmemory 32.

The shells 22 a, 22 b comprise a first interface towards thecommunication network layer being a communication layer. This layer isuniform or generic for all the shells. Furthermore the shells 22 a, 22 bcomprise a second interface towards the processor 12 a, 12 b to whichthe shells 22 a, 22 b are associated to, respectively. The secondinterface is a task-level interface and is customised towards theassociated processor 12 a, 12 b in order to be able to handle thespecific needs of said processor 12 a, 12 b. Accordingly, the shells 22a, 22 b have a processor-specific interface as the second interface butthe overall architecture of the shells is generic and uniform for allprocessors in order to facilitate the re-use of the shells in theoverall system architecture, while allowing the parameterisation andadoption for specific applications.

The shell 22 a, 22 b comprise a reading/writing unit for data transport,a synchronisation unit and a task switching unit. These three unitscommunicate with the associated processor on a master/slave basis,wherein the processor acts as master. Accordingly, the respective threeunit are initialised by a request from the processor. Preferably, thecommunication between the processor and the three units is implementedby a request-acknowledge handshake mechanism in order to hand overargument values and wait for the requested values to return. Thereforethe communication is blocking, i.e. the respective thread of controlwaits for their completion.

The reading/writing unit preferably implements two different operations,namely the read-operation enabling the processors 12 a, 12 b to readdata objects from the memory and the write-operation enabling theprocessor 12 a, 12 b to write data objects into the memory 32. Each taskhas a predefined set of ports which correspond to the attachment pointsfor the data streams. The arguments for these operations are an ID ofthe respective port ‘port_id’, an offset ‘offset’ at which thereading/writing should take place, and the variable length of the dataobjects ‘n_bytes’. The port is selected by a ‘port_id’ argument. Thisargument is a small non-negative number having a local scope for thecurrent task only.

The synchronisation unit implements two operations for synchronisationto handle local blocking conditions on reading from an empty FIFO orwriting to an full FIFO. The first operation, i.e. the getspaceoperation, is a request for space in the memory implemented as a FIFOand the second operation, i.e. a putspace operation, is a request torelease space in the FIFO. The arguments of these operations are the‘port_id’ and ‘n-bytes’ variable length.

The getspace operations and putspace operations are performed on alinear tape or FIFO order of the synchronisation, while inside thewindow acquired by the said the operations, random access read/writeactions are supported.

The task switching unit implements the task switching of the processoras a gettask operation. The arguments for these operations are‘blocked’, ‘error’, and ‘task_info’.

The argument ‘blocked’ is a Boolean value which is set true if the lastprocessing step could not be successfully completed because a getspacecall on an input port or an output port has returned false. Accordingly,the task scheduling unit is quickly informed that this task shouldbetter not be rescheduled unless a new ‘space’ message arrives for theblocked port. This argument value is considered to be an advice onlyleading to an improved scheduling but will never affect thefunctionality. The argument ‘error’ is a Boolean value which is set trueif during the last processing step a fatal error occurred inside thecoprocessor. Examples from mpeg decode are for instance the appearanceof unknown variable-length codes or illegal motion vectors. If so, theshell clears the task table enable flag to prevent further schedulingand an interrupt is sent to the main CPU to repair the system state. Thecurrent task will definitely not be scheduled until the CPU interactsthrough software.

The operations just described above are initiated by read calls, writecalls, getspace calls, putspace calls or gettask calls from theprocessor.

FIG. 4 depicts an illustration of the process of reading and writing andits associated synchronisation operations. From the processor point ofview, a data stream looks like an infinite tape of data having a currentpoint of access. The getspace call issued from the processor askspermission for access to a certain data space ahead of the current pointof access as depicted by the small arrow in FIG. 4 a. If this permissionis granted, the processor can perform read and write actions inside therequested space, i.e. the framed window in FIG. 4 b, usingvariable-length data as indicated by the n_bytes argument, and at randomaccess positions as indicated by the offset argument.

If the permission is not granted, the call returns false. After one ormore getspace calls—and optionally several read/write actions—theprocessor can decide if is finished with processing or some part of thedata space and issue a putspace call. This call advances thepoint-of-access a certain number of bytes, i.e. n_bytes2 in FIG. 4 d,ahead, wherein the size is constrained by the previously granted space.

The method to general processing steps according to the preferredembodiment as shown in FIG. 2 can also be performed on the basis of thedata processing system according to FIG. 3. The main difference is thatthe shells 22 of the respective processors 12 in FIG. 3 take overcontrol of the communication between the processors and the memory.

Accordingly, in FIG. 2 a flow chart the principal processing of theprocessor 12 a, 12 b is shown. In step S1 the processor performs thegettask call directed to the task scheduling unit in the shell 22 ofsaid processor 12, in order to determine with which task it is supposedto continue. In step S2, the processor receives from its associatedshell 22 or more precisely from the task scheduling unit of said shell22, the respective information about the next task to be processed.Thereafter, in step S3 the processing continues with checking inputstreams belonging to the associated task to be processed next in orderto decide whether sufficient data or other processing resources areavailable to perform the requested processing. This initialinvestigation may involve attempts to read some partial input and alsodecoding of packet headers. If it is determined in step S4 that theprocessing can continue since all necessary processing resources are athand, the flow jumps to step S5 and the respective processor 12continues with processing the current task. After the processor 12 hasfinished this processing in step S6 the flow will jump to the nextprocessing step and the above-mentioned steps will be repeated.

However, if in step S4 it is determined that the processor 12 can notcontinue with the processing of the current task, i.e. it can notcomplete the current processing step, due to insufficient processingresources like a lack of data in one of the input streams, the flow willbe forwarded to step S7 and all results of the partial processing doneso far will be discarded without any state saving, i.e. without anysaving of the partial processing results processed so far in thisprocessing step. The partial processing may include some getspace calls,data read operations, or some processing on the acquired data.Thereafter, in step S8 the flow will be directed to restart and fullyre-do the unfinished processing step at a later stage. However,abandoning the current task and discarding the partial processingresults will only be possible as long as the current task did not commitany of its stream actions by sending the synchronisation message.

FIG. 5 depicts an illustration of the cyclic FIFO memory. Communicatinga stream of data requires a FIFO buffer, which preferably has a finiteand constant size. Preferably, it is pre-allocated in memory, and acyclic addressing mechanism is applied for proper FIFO behaviour in thelinear memory address range.

A rotation arrow 50 in the centre of FIG. 5 depicts the direction onwhich getspace calls from the processor confirm the granted window forread/write, which is the same direction in which putspace calls move theaccess points ahead. The small arrows 51, 52 denote the current accesspoints of tasks A and B. In this example A is a writer and hence leavesproper data behind, whereas B is a reader and leaves empty space (ormeaningless rubbish) behind. The shaded region (A1, B1) ahead of eachaccess point denote the access window acquired through getspaceoperation.

Tasks A and B may proceed at different speeds, and/or may not beserviced for some periods in time due to multitasking. The shells 22 a,22 b provide the processors 12 a, 12 b on which A and B run withinformation to ensure that the access points of A and B maintain theirrespective ordering, or more strictly, that the granted access windowsnever overlap. It is the responsibility of the processors 12 a, 12 b touse the information provided by the shell 22 a, 22 b such that overallfunctional correctness is achieved. For example, the shell 22 a, 22 bmay sometimes answer a getspace requests from the processor false, e.g.due to insufficient available space in the buffer. The processor shouldthen refrain from accessing the buffer according to the denied requestfor access.

The shells 22 a, 22 b are distributed, such that each can be implementedclose to the processor 12 a, 12 b that it is associated to. Each shell22 a, 22 b locally contains the configuration data for the streams whichare incident with tasks mapped on its processor, and locally implementsall the control logic to properly handle this data. Accordingly, a localstream table is implemented in the shells 22 a, 22 b that contains a rowof fields for each stream, or in other words, for each access point.

To handle the arrangement of FIG. 5, the stream table of the processorshells 22 a, 22 b of tasks A and B each contain one such line, holding a‘space’ field containing a (maybe pessimistic) distance from its ownpoint of access towards the other point of access in this buffer and anID denoting the remote shell with the task and port of the otherpoint-of-access in this buffer. Additionally, said local stream tablemay contain a memory address corresponding to the current point ofaccess and the coding for the buffer base address and the buffer size inorder to support cited address increments.

These stream tables are preferably memory mapped in small memories, likeregister files, in each of said shells 22. Therefore, a getspace callcan be immediately and locally answered by comparing the requested sizewith the available space locally stored. Upon a putspace call this localspace field is decremented with the indicated amount and a putspacemessage is sent to the another shell which holds the previous point ofaccess to increment its space value. Correspondingly, upon reception ofsuch a put message from a remote source the shell 22 increments thelocal field. Since the transmission of messages between shells takestime, cases may occur where both space fields do not need to sum up tothe entire buffer size but might momentarily contain the pessimisticvalue. However this does not violate synchronisation safety. It mighteven happen in exceptional circumstances that multiple messages arecurrently on their way to destination and that they are serviced out oforder but even in that case the synchronisation remains correct.

FIG. 6 shows a mechanism of updating local space values in each shelland sending ‘putspace’ messages. In this arrangement, a getspacerequest, i.e. the getsspace call, from the processor 12 a, 12 b can beanswered immediately and locally in the associated shell 22 a, 22 b bycomparing the requested size with the locally stored space information.Upon a putspace call, the local shell 22 a, 22 b decrements its spacefield with the indicated amount and sends a putspace message to theremote shell. The remote shell, i.e. the shell of another processor,holds the other point-of-access and increments the space value there.Correspondingly, the local shell increments its space field uponreception of such a putspace message from a remote source.

The space field belonging to point of access is modified by two sources:it is decrement upon local putspace calls and increments upon receivedputspace messages. It such an increment or decrement is not implementedas atomic operation, this could lead to erroneous results. In such acase separated local-space and remote-space field might be used, each ofwhich is updated by the single source only. Upon a local getspace callthese values are then subtracted. The shells 22 are always in control ofupdates of its own local table and performs these in an atomic way.Clearly this is a shell implementation issue only, which is not visibleto its external functionality.

If getspace call returns false, the processor is free to decide an howto react. Possibilities are, a) the processor my issue a new getspacecall with a smaller n_bytes argument, b) the processor might wait for amoment and then try again, or c) the processor might quit the currenttask and allow another task on this processor to proceed.

This allows the decision for task switching to depend upon the expectedarrival time of more data and the amount of internally accumulated statewith associated state saving cost. For non-programmable dedicatedhardware processors, this decision is part of the architectural designprocess.

The implementation and operation of the shells 22 do not to makedifferentiations between read versus write ports, although particularinstantiations may make these differentiations. The operationsimplemented by the shells 22 effectively hide implementation aspectssuch as the size of the FIFO buffer, its location in memory, anywrap-around mechanism on address for memory bound cyclic FIFO's, cachingstrategies, cache coherency, global I/O alignment restrictions, data buswidth, memory alignment restrictions, communication network structureand memory organisation.

Preferably, the shell 22 a, 22 b operate on unformatted sequences ofbytes. There is no need for any correlation between the synchronisationpacket sizes used by the writer and a reader which communicate thestream of data. A semantic interpretation of the data contents is leftto the processor. The task is not aware of the application graphincidence structure, like which other tasks it is communicating to andon which processors these tasks mapped, or which other tasks are mappedon the same processor.

In high-performance implementations of the shells 22 the read call,write call, getspace call, putspace calls can be issued in parallel viathe read/write unit and the synchronisation unit of the shells 22 a, 22b. Calls acting on the different ports of the shells 22 do not have anymutual ordering constraint, while calls acting on identical ports of theshells 22 must be ordered according to the caller task or processor. Forsuch cases, the next call from the processor can be launched when theprevious call has returned, in the software implementation by returningfrom the function call and in hardware implementation by providing anacknowledgement signal.

A zero value of the size argument, i.e. n_bytes, in the read call can bereserved for performing pre-fetching of data from the memory to theshells cache at the location indicated by the port_ID- andoffset-argument. Such an operation can be used for automaticpre-fetching performed by the shell. Likewise, a zero value in the writecall can be reserved for a cache flush request although automatic cacheflushing is a shell responsibility.

Optionally, all five operations accept an additional last task_IDargument. This is normally the small positive number obtained as resultvalue from an earlier gettask call. The zero value for this argument isreserved for calls which are not task specific but relate to processorcontrol.

In the preferred embodiment the set-up for communication a data streamis a stream with one writer and one reader connected to the finite-sizeof FIFO buffer. Such a stream requires a FIFO buffer which has a finiteand constant size. It will be pre-allocated in memory and in its linearaddress range is cyclic addressing mechanism is applied for proper FIFObehaviour.

However in a further embodiment based on FIG. 3 and FIG. 7, the datastream produced by one task is to be consumed by two or more differentconsumers having different input ports. Such a situation can bedescribed by the term forking. However, we want to re-use the taskimplementations both for multi-tasking hardware processors as well asfor software task running on the CPU. This is implemented through taskshaving a fixed number of ports, corresponding to their basicfunctionality and that any needs for forking induced by applicationconfiguration are to be resolved by the shell.

Clearly stream forking can be implemented by the shells 22 by justmaintaining two separate normal stream buffers, by doubling all writeand putspace operations and by performing an AND-operation on the resultvalues of doubled getspace checks. Preferably, this is not implementedas the costs would include a double write bandwidth and probably morebuffer space. Instead preferably, the implementation is made with two ormore readers and one writer sharing the same FIFO buffer.

FIG. 7 shows an illustration of the FIFO buffer with a single writer andmultiple readers. The synchronisation mechanism must ensure a normalpair wise ordering between A and B next to a pair wise ordering betweenA and C, while B and C have no mutual constraints, e.g. assuming theyare pure readers. This is accomplished in the shell associated to theprocessor performing the writing operation by keeping track of availablespace separately for each reader (A to B and A to C). When the writerperforms a local getspace call its n_bytes argument is compared witheach of these space values. This is implemented by using extra lines insaid stream table for forking connected by one extra field or column toindicate changing to a next line.

This provides a very little overhead for the majority of cases whereforking is not used and at the same time does not limit forking totwo-way only. Preferably, forking is only implemented by the writer andthe readers are not aware of this situation.

In a further embodiment based on FIG. 3 and FIG. 8, the data stream isrealised as a three station stream according to the tape-model. Eachstation performs some updates of the data stream which passes by. Anexample of the application of the three station stream is one writer,and intermediate watchdog and the final reader. In such example of thesecond task preferably watches the data that passes and may be inspectssome while mostly allowing the data to pass without modification.Relatively infrequently it could decide to change a few items or dataobjects in the stream. This can be achieved efficiently by in-placebuffer updates by a processor to avoid copying the entire streamcontents from one buffer to another. In practice this might be usefulwhen hardware processors 12 communicate and the main CPU 11 intervenesto modify the stream to correct hardware flaws, to do adaptation towardsslightly different stream formats, or just for debugging reasons. Such aset-up could be achieved with all three processors sharing the singlestream buffer in memory, to reduce memory traffic and processorworkload. The task B will not actually read or write the full datastream.

FIG. 8 depicts a finite memory buffer implementation for a three-stationstream. The proper semantics of this three-way buffer includemaintaining a strict ordering of A, B and C with respect to each otherand ensuring no overlapping windows. In this way the three-way buffer isa extension from the two-way buffer shown in FIG. 5. Such a multi-waycyclic FIFO is directly supported by the operations of the shells asdescribed above as well as by the distributed implementation style withputspace messages as discussed in the preferred embodiment. There is nolimitation to just three stations in a single FIFO. In-place processingwhere one station both consumes and produces useful data is alsoapplicable with only two stations. In this case both tasks performingin-place processing to exchange data with each other and no empty spaceis left in the buffer.

In another embodiment based on the preferred embodiment of FIG. 2, theidea of the logical separation of read/write operations andsynchronisation operations is implemented as a physical separation ofthe data transport, i.e. the read and a write operations, and thesynchronisation. Preferably, a wide bus allowing high bandwidths for thetransport, i.e. the read/write operations of data, is implemented. Aseparate communication network is implemented for the synchronisationoperations, since it did not appeared preferable to use the same widebus for synchronisation. This arrangement has the advantage that bothnetworks can be optimised for their respective use. Accordingly, thedata transport network is optimised for memory I/O, i.e. the reading andwriting operations, and the synchronisation network is optimised forinter-processor messages.

The synchronisation network is preferably implemented as a messagepassing ring network, which is especially tuned and optimised for thispurpose. Such a ring network is small and very scalable supporting theflexibility requirement of a scalable architecture. The higher latencyof the ring network does not influence the performance of the networknegatively as the synchronisation delays are absorbed by the data streambuffers and memories. The total throughput of the ring the network isquite high and each link in the ring can pass a synchronisation messagesimultaneously allowing as many messages in-flight as there areprocessors.

In still another embodiment based on FIG. 3, the idea of the physicalseparation of the data transport and synchronisation is realised. Thesynchronisation units in the shell 22 a are connected to othersynchronisation units in another shell 22 b. The synchronization unitsensures that one processor does not access memory locations before validdata for a processed stream has been written to these memory locations.Similarly, synchronization interface is used to ensure that theprocessor 12 a does not overwrite useful data in memory 32.Synchronization units communicate via a synchronization message network.Preferably, they form part of a ring, in which synchronization signalsare passed from one processor to the next, or blocked and overwrittenwhen these signals are not needed at any subsequent processor. Thesynchronization units together form a synchronization channel. Thesynchronization unit maintain information about the memory space whichis used for transferring the stream of data objects from processor 12 ato processor 12 b.

1. Method for processing data in a data processing system, said systemcomprising a first and at least a second processor for processingstreams of data objects, said first processor being arranged to passdata objects from a stream of data objects to the second processor; andat least one memory for storing and retrieving data objects, wherein ashared access for said first and said second processors is provided,said method comprising the steps of: said processors perform readoperations and/or write operations to exchange data objects with saidmemory; and said processors perform inquiry operations and/or commitoperations to synchronise data object transfers between tasks which areexecuted by said processors; wherein said inquiry operations and saidcommit operations are performed by said processors independently of saidread operations and said write operations.
 2. Method according to claim1, wherein said inquiry operations are executed by one of said secondprocessors to request the right to access a group of data objects insaid memory, wherein said group of data objects is produced or consumedin said memory by a series of read/write operations by said processors;and said commit operations are executed by one of said second processorsto transfer the right to access said group of data objects to another ofsaid second processors.
 3. Method according to claim 1, wherein saidmemory is a FIFO buffer, and said inquiry and commit operations are usedto control the FIFO behaviour of said memory buffer in order totransport streams of data objects between said first and secondprocessors through said shared memory buffer.
 4. Method according toclaim 1, wherein a third processor receives the right to access a groupof data objects from said first processor, performs read and/or writeoperations on said group of data objects, and transfers the right ofaccess to said second processor, without copying said group of dataobjects to another location in said shared memory.
 5. Method accordingto claim 1, wherein said second processor is a multi-tasking processor,capable of interleaved processing of at least a first and second tasks,wherein said at least a first and second tasks process streams of dataobjects.
 6. Method according to claim 1, wherein said second processorsbeing function-specific dedicated processors for performing a range ofstream processing tasks.
 7. Method according to claim 2, wherein saidread/write operations enable said second processors to randomly accesslocations within one of said groups of data objects in said memory. 8.Method according to claim 1, wherein the further processing of a groupof data object of a first task is temporarily prevented, when theprocessing of said group of data objects is interrupted, whereinprocessing of data objects of a second task is carried out whenprocessing of said group of data elements of said first task isinterrupted.
 9. Method according to claim 8, wherein after theinterruption of the task the actual task state of the partial processingof the group of data objects is discarded and commit operation of thepartial group of data objects is prevented.
 10. Method according toclaim 7, wherein after resumption of the interrupted task, the processorrestarts the processing of the group of data objects, whereby previousprocessing on said group is discarded.
 11. A data processing system,comprising: a first and at least a second processor for processingstreams of data objects, said first processor being arranged to passdata objects from a stream of data objects to the second processor; andat least one memory for storing and retrieving data objects, wherein ashared access for said first and said second processors is provided,said processors being adopted to perform read operations and/or writeoperations to exchange data objects with said memory; and saidprocessors being adopted to perform inquiry operations and/or commitoperations to synchronise data object transfers between tasks which areexecuted by said processors; wherein said processors being adopted toperform said inquiry operations and said commit operations independentlyof said read operations and said write operations.
 12. A data processingsystem according to claim 11, wherein said second processor beingadapted to execute said inquiry operations to request the right toaccess a group of data objects in said memory, wherein said group ofdata objects is produced or consumed in said memory by a series ofread/write operations by said processors; and said second processorbeing adapted to execute said commit operations to transfer the right toaccess said group of data objects to another of said second processors.13. A data processing system according to claim 11, wherein said memoryis a FIFO buffer, and said processors being adopted to perform saidinquiry and commit operations in order to control the FIFO behaviour ofsaid memory buffer for transporting streams of data objects between saidfirst and second processors through said shared memory buffer.
 14. Adata processing system according to claim 11, comprising a thirdprocessor being adapted to receive the right to access a group of dataobjects from said first processor, to perform read and/or writeoperations on said group of data objects, and to transfer the right ofaccess to said second processor, without copying said group of dataobjects to another location in said shared memory.
 15. Data processingsystem according to claim 11, wherein said second processor is themulti-tasking processor, capable of interleaved processing of at least afirst and second tasks, wherein said at least a first and second tasksprocess streams of data objects.
 16. Data processing system according toclaim 11, wherein said second processors being function-specificdedicated processors for performing a range of stream processing tasks.17. Data processing system according to claim 12, wherein said secondprocessors is adopted to perform read and/or write operations enablingto randomly access locations within one of said groups of data objectsin said memory.
 18. Data processing system according to claim 11,wherein the further processing of a group of data object of a first taskis temporarily prevented, when the processing of said group of dataobjects is interrupted, wherein processing of data objects of a secondtask is carried out when processing of said group of data elements ofsaid first task is interrupted.
 19. Data processing system according toclaim 18, wherein after the interruption of the task the actual taskstate of the partial processing of the group of data objects isdiscarded and commit operation of the partial group of data objects isprevented.
 20. Data processing system according to claim 19, whereinafter resumption of the interrupted task, the processor restarts theprocessing of the group of data objects, whereby previous processing onsaid group is discarded.